Google Cloud Datalab provides an integrated environment for working with Google BigQuery for both ad-hoc, exploratory work as well as pipeline development. You've already seen the use of %%bq
in the Hello BigQuery notebook, and various commands in the BigQuery Commands notebook.
These BigQuery commands are in fact built using the same BigQuery APIs that are available for your own use. This notebook introduces some examples of using those APIs.
In [2]:
import google.datalab.bigquery as bq
The most important BigQuery-related API is the one that allows you to execute a SQL query. The google.datalab.bigquery.Query
class provides that functionality. To run a query using BigQuery Standard SQL, create a new Query
object with the desired SQL string, or use an object that has already been defined by the %%bq query --name
command. Let's take a look at an example:
In [4]:
# Create and run a SQL query
httplogs_query = bq.Query('SELECT * FROM `cloud-datalab-samples.httplogs.logs_20140615` LIMIT 3')
Let's run the query created above with caching turned off, so we're sure to be able to retrieve metadata, such as bytes processed from resulting query job.
For this, we'll need to use a QueryOutput
object.
In [5]:
output_options = bq.QueryOutput.table(use_cache=False)
result = httplogs_query.execute(output_options=output_options).result()
result
Out[5]:
The result object is a QueryResultsTable
class, and can be enumerated in the same manner a regular Python list, in addition to retrieving metadata about the result.
In [4]:
# Inspecting the result, and the associated job
print(result.sql)
print(str(result.length) + ' rows')
print(str(result.job.bytes_processed) + ' bytes processed')
In [5]:
# Inspect the programmatic representation.
# Converting the QueryResultsTable to a vanilla list enables viewing the literal data,
# as well as side-stepping the HTML rendering seen above.
list(result)
Out[5]:
The QueryResultsTable
has more functionality you can explore, such as converting it to a pandas dataframe, or exporting to a file.
In [7]:
result.to_dataframe()
Out[7]:
In [8]:
type(result.to_dataframe())
Out[8]:
In [9]:
UniqueNames2013 = bq.Query(sql='''
WITH UniqueNames2013 AS
(SELECT DISTINCT name
FROM `bigquery-public-data.usa_names.usa_1910_2013`
WHERE Year = 2013)
SELECT * FROM UniqueNames2013
''')
To execute the query and view a sample from the result table, use a Sampling
object, let's use random sampling for this example:
In [10]:
sampling = bq.Sampling.random(percent=2)
job = UniqueNames2013.execute(sampling=sampling)
job.result()
Out[10]:
Notice every time we run the query above, we get a different set of results, since we chose a random sampling of 2%.
We can also run the query and copy the sampled result into a pandas DataFrame directly. For that, we use a QueryOutput
object of type dataframe
:
In [11]:
output_options = bq.QueryOutput.dataframe(max_rows=10)
job = UniqueNames2013.execute(output_options=output_options)
In [12]:
job.result()
Out[12]:
In [13]:
from google.datalab import Context
datasets = bq.Datasets(Context('cloud-datalab-samples', Context.default().credentials))
for ds in datasets:
print(ds.name)
In [14]:
sample_dataset = list(datasets)[1]
tables = sample_dataset.tables()
for table in tables:
print('%s (%d rows - %d bytes)' % (table.name.table_id, table.metadata.rows, table.metadata.size))
In [15]:
table = bq.Table('cloud-datalab-samples.httplogs.logs_20140615')
fields = map(lambda tsf: tsf.name, table.schema)
list(fields)
Out[15]:
In [16]:
# Create a new dataset (this will be deleted later in the notebook)
sample_dataset = bq.Dataset('apisample')
sample_dataset.create(friendly_name = 'Sample Dataset', description = 'Created from Sample Notebook')
sample_dataset.exists()
Out[16]:
In [41]:
# To create a table, we also need to create a schema.
# We've seen before how to create a schema from some existing data,
# let's now try creating it from a list of records:
schema = [
{'name': 'name', 'type': 'STRING'},
{'name': 'value', 'type': 'INT64'},
{'name': 'flag', 'type': 'BOOL', 'mode': 'NULLABLE'}
]
sample_schema = bq.Schema.from_data(schema)
sample_table = bq.Table("apisample.sample_table").create(schema = sample_schema, overwrite = True)
In [42]:
sample_table.schema
Out[42]:
You can run the cell, below, to see the contents of the new dataset:
In [ ]:
list(sample_dataset.tables())
In [44]:
# Clear out sample resources
sample_dataset.delete(delete_contents = True)